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ABSTRACT 

This  report  describes  the  research  activities  performed  by  the 
University  of  Southern  California,  Naval  Ocean  System  Center,  Hughes 
Research  Laboratories,  University  of  California,  San  Diego,  Stanford 
University  and  Integrated  Systems,  Inc.,  under  the  the  SRO  project, 
for  the  period  1  March  1981  to  28  February  1982  under  contract  No.: 
N000114-81-K-0191  with  the  Office  of  Naval  Research.  The  research 
activities  have  focussed  on  the  VLSI  signal  processing  theory  and 
algorithms  and  the  development  of  parallel  computing  architectures. 

A  solution  in  today's  VLSI  research  challenge  lies  in  a  cross- 
disciplinary  research  encompassing  the  areas  of  mathematics,  algorithms, 
computers  and  applications.  To  this  end,  this  report  summarizes  two 
parallel  major  research  tasks:  (1)  Signal  processing  algorithm  and 
theory  -  emphasizing  spectral  analysis  and  its  applications;  and 
(2)  parallel  computing  structures  -  utilizing  VLSI  potential  for  high¬ 
speed  signal  processing. 
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This  repori  describes  cur  research  activities  under  the  SFO 
Project  for  the  period  Parch  1,  1  Q°1  tc  February  2P,  1?S2.  The 
research  activities  have  focussed  on  the  study  of  VL FI  signal 
processing  theory  and  algcri  Lhms  and  the  develcpnent  of  parallel 
computing  architectures. 

With  the  rapidly  growing  microelectronics  technology  lead  in®  the 
way,  modern  signel  processing  is  undergoing  a  major  revolution.  The 
availability  cf  lew  cost,  fast  VLSI  devices  premises  the  practice  of 
increasingly  complex  and  sophisticated  algorithms  and  systems. 
However,  in  conjunction  with  such  promise,  there  is  accompanied  a  new 
challenge  of  how  to  update  the  signal  processing  techniques  sc  as  to 
effectively  utilize  the  large-scale  computation  capability.  The 
answer  to  this  challenge  lies  in  a  cross-disciplinary  research 
encompassing  the  areas  cf  mathematics,  algorithms,  computers  and 
applications.  Tc  this  end,  two  parallel  major  research  tasks  have 
been  undertaken  in  the  OK  H- SFO  research  group; 

(l  )  Signal  Processing  algorithm  and  theory  -  emphasizing  spectral 
analysis  and  its  applications;  and  (2)  parallel  computing  structures 
-  utilizing  VLSI  potential  for  high-speed  signal  processing. 

In  the  area  cf  signal  processing  theory  and  algorithms, 
significant  work  has  been  made  on  the  following  topics  vi  th  special 
emphasis  on  high  resolution  spectral  estimation:  Pecursive  least 

squares  ladder  form  algorithms  (ISI,  Stanford),  adaptive  least 
squares  lattice  (’JCSD),  Parallel  Kalman  filtering  for  both  linear  and 
non  linear  signal  processing  (ISI  Stanford),  adaptive  notch  filtering 
(’JSC),  KPI,  AEP’A,  and  II,  spectral  estimation  in  1-D  and  2-S  (JSC, 


3 


Ftanford,  and  IFI),  and  Tceplitz  approx  ima  Lien  O.JPC).  Parallel 
impleracn ia lien  of  these  algorithms  has  been  a  major  consideration  in 
their  development. 

In  the  complimentary  area  of  parallel  computing  structures,  both 
dedicated  and  flexible  architectures  have  been  developed  for  signal 
processing  tasks  and  applications.  Works  in  progress  include: 
Tceplitz  system  solver  using  pipelined  Levinson  and  implements  tier 
(U  PC  and  Hughes),  programmable  VavefrcnL  A rray  Processor  and  data 
flew  language  for  VLPI  signal  processing  elgcrit^ms  f'lPC),  Pysiclic 
Arrays  for  real-time  signal  processing  applications  in  Fpectrun 
Analysis  and  Direction  find  ing(NCPP) ,  and  implementation  of  Fadeev 
algorithm  (Hughes)  and  Pistol  ic  architectures  for  ladder  forms  and 
parallel  Kalman  filters  (Ptanford  and  I  FT). 

A  brief  summary  of  the  technical  work,  grouped  in  terms  of 
research  and,  is  described  in  the  following  sections. 

FFCTION  1  :  FTCiJAL  °R0CFPT!I0  ALT 0 PT '■’!!*? F  A’CD  THFOFY 

Pec .  1.1  :  adaptive  spectral  estimation; 

Pec.  1.2:  sigral  processing  theory; 

Pec.  1.3:  parallel  signal  processing  algorithm; 

PFCTION  2:  FID  FLY  v  AP  ALL  FI  COIP'JTTKD  PTFUCTJF.FP 

Pec.  2.1:  dedicated  signal  processing  architectures; 

Pec.  2.2:  array  processors;  and 

Pec.  2.7:  VLPI  implementation  of  signal  processing 

archi tec  lures . 

Following  this  summary  are  the  progress  reports  from  individual 


insti lutes 


1.  Signal  Processing  Algorithms  and  Theory 


As  to  the  first  research  front,  it  hinges  upon  a  thorough,  in- 
depth  understanding  cf  mathematics  and  algorithm  analysis.  Tn 
addition  to  the  classical  mathematical  techniques  such  as  Fourier 
transform,  linear  dynamic  systems,  random  process,  etc.  there  arises 
a  new  signal  processing  mathematics  branch  which  can  be  grossly 
termed  as  modern  spectral  analysis.  Fxplicitly  or  r.ot,  a  large  class 
cf  signal  processing  applications  have  had  extensive  use  cf  this 
analysis  as  a  technical  basis.  'rherefcre,  our  research  effort  aims 
at  developing  a  theoretical  and  algorithmic  basis  'or  modern  spectral 
analysis  me thcds  and  signal  troc^ssing  applications. 


1.1  Adaptive  Spectral  Fstimaticn  Methods 


1.1.1  Adaptive  Notch  Filtering  (UFC  f  1  —3 1 ) 

Using  a  steady  state  frequency  domain  approach,  a  new  method  has 
been  developed  for  the  retrieval  cf  sinusc id s/ narrowband  signals  ir. 
additive  noise  colored  or  white.  The  method  suggested  has  been  shown 
to  require  smaller  filter  length  to  produce  unbiased  estimates, 
compared  to  the  existing  autoregressive  method.  For  its 

implementation,  a  pole-zero  filter  where  the  feedback  and  feedforward 
coefficients  are  related  (constrained  AF?'/),  has  been  developed.  A 
study  cf  the  performance  and  implementations!  aspects  of  the  filter 
have  been  undertaken.  The  details  cf  this  newly  developed  nre 


discussed  in  the  full  report.  Fbr  a  stable  implementation,  parallel 
and  cascade  forms  have  been  shewn  to  be  useful.  A  parallel 
processing  scheme  developed  shews  great  premise. 


1.1.2  Adaptive  Least  Squares  Ladder  Form  Algorithm  (U  C5D  [  17-1 8], 

I  SI  and  Stanford!” 20-22]) 

The  gradient  methods  of  adaptive  filter  in  piemen  La  Li  or.  require 
various  adaptive  power  estimates  to  be  made.  The  performance 
sensitivity  of  the  gradient  methods  to  different  time  constants  of 
the  adaptive  power  estimation  loops  has  been  under  study  .a  l  UCfT. 
For  the  specific  case  of  underlying  frequency  versus  tine  dynamics 
consisting  of  dual  steps,  the  simulations  presented  ir  the  report 
investigate  this  sensitivity  in  the  context  of  a  frequency  tracking 
problem.  Rased  on  ar.  intensive  study  on  both  the  eigenvalues  and 
singular  values  of  the  au tcccrrela ticn  matrix  arising  from  complex 
data,  adaptive  lattice  structures  appear  to  show  significant 
performance  advantages  ever  their  direct  form  counterparts,  due  to  an 
insensitivity  to  eigenvalue  spread.  A  documentation  task  on  the 
eigenvalue  spread  present  both  in  future  controlled  sin  u.h  lions ,  as 
well  as  in  actual  sonar  data,  is  in  progress. 

Tn  addition,  a  recursive  le8S t-squares  ladder- form  algorithm  for 
predicted  residuals  rather  than  filtered  residuals  has  been  derived 
at  TFI  and  ftanferd.  The  reflection  coefficients  and  the  order 
updates  cf  the  residuals  in  the  new  algorithm  are  computed 
simultaneously.  This  formulation  improves  the  throughput  rate  and 


numerical  suability  cf  existing  recursive  least-squares  algorithms. 

1.2  Signal  Processing  Theory 

1.2.1  2-D  Spectral  Fstimation  (U SC  f  1 3 ] ) 

Our  recent  research  has  been  concerned  vi Lh  developing  systematic 
methods  for  2-D  spectral  esLimaiicr.  from  raw  data  using  random  field 
models.  V.’e  assume  that  the  eiver.  finite  data  is  represented  by  an 
appropriate  Gaussian  Farkcv  random  field  ff'FF)  model. 

Fy  usin^  specific  finite  toroidal  lattice  representations  and 
Gaussian  maximum  likelihood  estimates  we  have  developed  new  2-D 
spectral  estimates.  It  turns  cut  that  the  FFF  spectrum  is  also  the 
maximum  likelihood  spectrum  arising  in  frequency-wave  number 
analysis.  Furthermore,  the  sample  correlation  values  of  the  given 
observations  in  an  array  R  are  in  perfect  agreement  with  the 

estimated  theoretical  correlations  in  V  obtained  by  Fourier 
inverting  the  FFF  spectrum.  Thus  the  ?"PF  spectrum  developed  by  us 
converges  to  the  2-D  maximum  entropy  soectral  estimate 

asymptotically.  Currently  we  have  begun  investigations  on  parallel 
implementation  cf  the  algorithms  for  2-D  spectral  estimation. 
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1.2.2  Fela tionships  Between  Several  Popular  Methods  for  Spectral 
Estimation  and  >rray  Processing 

It  may  seem  tcc  ambitious  to  compare  all  currently  popular  high- 
resoluticn  spectral  estimation  methods.  For  example,  while  maximum 
entropy  method  related  to  autoregressive  modeling  is  recievir.g  a 
tremendous  popularity,  it  may  suffer  from  bias  and  resolution 
problems  when  additive  noise  is  non-negligible.  On  the  other  hand, 
Pisarenko's  method  based  on  sinusoidal  modeling  en.icvs  relatively 
better  perfomance  in  the  presence  of  noise  buL  in  general  suffers 
from  numerical  sensitivity  problems.  However,  from  a  different 
perspective,  Pisarenko's  method  can  he  viewed  as  an  extension  of  the 
KF?\  method  with  the  removal  of  the  noise  contribution.  Therefore,  an 
attempt  is  being  made  at  developing  a  unified  framework  for  Lhe 
spectral  analysis  techniques.  Moreover,  the  unification  attempt  is 
being  extended  to  the  counterpart  of  spectral  analysis  in  array 
processing  application.  Though  the  covariance  matrix  will  no  longer 
have  a  Tcepliiz  structure  and  the  phasing  vectors  .are  mere  complex  in 
array  processing  situations,  we  are  convinced  that  the  general 
principles  remain  largely  applicable.  We  are  currently  locking  into 
theoretical  and  computational  relevances  between  several  modern  array 
processing  and  spectrum  estimation  methods. 


R 

1.2.3  Toeplitz  Approximation  Method  (USC[4]) 

Recently,  the  study  cn  npproxima tier  theory  and  its  application? 
has  received  considerable  attention.  In  cur  work,  a 

narrowband /sinusoidal  signal  retrieval  problem  is  formulated  in  terns 
of  approximation  of  Toeplitz  autcccvariance  matrix.  P  Toeplitz 
apprex ima ticn  method  based  cn  singular  valu°  decomposition  is 
proposed  and  simulation  results  indicate  some  improvement  ever  seme 
previously  proposed  methods. 

1.2.4  Scattering  Theory  (ISI  and  Stanfcrdf 20] ) 

This  task  has  .just  been  started  and  cur  hope  is  to  use  scattering 
theory  to  decouple  large-scale,  2 -D  signal  processing  problems  into 
blocks  which  can  be  processed  simultaneously,  foattering  theory  will 
also  provide  physical  insight  into  the  implementation  of  ladder  forms 
and  parallel  Kalman  filters. 

1.3  Parallel  Algorithms 


1.3.1  Parallel  Algorithms  for  Image  Processing  and  Analysis.  (U  SC) 

Our  recent  research  at  Image  "recessing  Insti  lute, UPC,  has  beer, 
concerned  with  parallel  algorithms  for  image  processing  and  image 
analysis.  Test  of  the  effort  has  been  concerned  with  parallel 
implementation  of  nensta ticnary  adaptive  image  restoration. 


Ferursive  and  ncn-recursive  implementation  of  locally  adaptive 


restoration  has  been  studied.  These  techniques  estimate  the  local 
nonstationary  mean  and  variance  of  ideal  scenes  frcm  degraded  data. 
Most  blurring  degradations  are  also  highly  local,  so  that  local 
parallel  processing  combined  witn  the  ncns ta ticnary  image  model  data 
can  be  used  to  minimize  local  mean-square  errc  (I'.FF)  in  a  parallel 
fashion.  V.’e  have  shewn  that  local  KSF  is  not  a  bad  error  criterion 
for  image  processing,  as  opposed  to  the  usual  global  T'FF  token  over 
the  entire  scene.  Global  VfF  often  dees  net  correlate  well  with 
human  observer  judgments  of  image  quality. 

We  have  locked  m  the  application  of  these  techniques  tc  systems 
with  coherent  speckle  ncise,  such  as  synthetic  aperature  ff/p) 
imagery,  coherent  sonar  and  acoustic  imaging.  Roth  recursive 
(Kalman-like)  and  local  sectioned  parallel  implements  tiens  are  being 
studied  in  detail. 

In  addition,  we  have  began  investigations  cn  parallel  feature 
extraction  for  texture  identification  and  texture  segmen  te  tier. . 


1.3.2  A  Parallel  Algorithm  for  Solving  Toeplitz  System  (U SC  f6-7l) 

We  have  developed  a  parallel  algorithm  for  solving  a  Toeplitz 
system  Tx  -  y  where  T  is  a  Toeplitz  matrix,  i.e.,  (t1^  “  t^_^ 

"  tk’  k  £  N.  Tn  general,  solving  an  N  by  K  linear 

system  takes  steps  of  operations.  In  contrast,  the 


Levinson  algorithm  effectively  utilizes  the  Toeplitz  structure  tc 


reduce  the  overall  computation  to  0(N’**2')  operations.  '’Tie  Levinson 
procedure,  however,  has  to  call  upon  an  inner  product  operation  to 
compute  the  vital  reflection  coefficients.  In  order  to  achieve  full 
parallelism,  we  have  to  further  exploit  the  Toeplitz  structure.  For 
this  purpose,  we  have  proposed  a  new,  pipelined  version  of  the 
Levinson  algorithm  which  allows  the  reflection  coeffecients  to  be 
computed  in  a  pipelined  fashion.  This  avoids  the  need  of  the  inner 
product  operations,  and  the  total  computing  time  is  therefore  reduced 
to  0(N). 

1.3.3  Tceplitz  Eigenvalue  Computation  (USC[5}) 

This  research  task  deals  with  the  parallel  computation  of  the 
minimum  eigenvalue  of  a  Ice pi i 12  matrix.  The  minimum  eigenvalue  has 
an  important  interpretation  as  the  power  cf  additive,  whiLe  noise  to 
be  determined  in  a  noisy  statistical  environment.  In  many  high 
resolution  spectrum  analysis  problems  ,  the  estimation  and 

removal  cf  such  ncise  contribution  is  essential  for  unbiased 
estimates.  Our  objective  is  again  to  derive  an  0(N)  computation 
algorithm  to  estimate  the  minimum  eigenvalue  cf  a  given  Tceplitz 
covariance  matrix.  This  goal  can  be  accomplished  by  edcpting  th» 
pipelined  Toeplitz  computing  structure  discussed  earlier  and  s 
careful  utilization  cf  a  relationship  between  the  minimum  eigenvalue 
and  the  redisues  F  that  arise  in  the  Levinson 

algorithm.  Fased  on  this  relationship,  a  fast  iterative  procedure  is 
developed  to  successively  estimate  the  minimum  eigenvalue.  Based  on 
simulation  results  for  such  an  application,  seme  improvements  are 


as  accuracy  cf 


observed  in  both  ihe  computing  speed  as  well 
estimates.  A1  though  much  mere  computational  complexity  analysis  is 
yet  to  be  demonstrated,  ve  are  convinced  that  this  approach  will  have 
a  major  in  future  applications  cf  high  speed,  high  resolution 
spectrum  estimation  problems. 


1.3.4  Applications  of  SVD  to  Signal  Processing  (NOSC,  USC) 

It  is  well  known  that  fVD  can  be  used  in  many  sigral  processing 
applications.  Therefore  parallel  (real-time)  implementation  has  been 
an  important  research  focus.  Tome  partial  results  are  offered  in  the 
report.  The  most  noteworthy  result  is  the  significant  numerical 
improvement  cf  f;Odb  in  terms  cf  dynamic  range  obtained  in  the 
computation  of  eigenvalue  of  R  -  A^A  via  TVD  cf  P.  ''his  approach 
is  being  extended  to  generalized  eigensystem  computation. 


1.3*5  Parallel  Kalman  Filter  Algorithms  (Stanford  and  I SI [25-27]) 

The  research  on  Parallel  Kalman  Filters  (15KF's)  has  been  divided 
into  two  major  tasks:  °FK's  for  linear  signal  processing 

applications  and  PKF's  for  nonlinear  signal  processing. 

For  linear  signal  procesing  applications,  the  predictor  and 
corrector  equations  in  the  Kalman  filter  can  be  computed  on  separate 
processors  simultaneously.  A  PKF  has  been  ceded  and  simulated.  A 


stability  analysis  of  the  PKF  is  currently  underway 
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For  nonlinear  signal  processing  applications  (such  as  spectrum 
estimation),  it  is  difficult  to  decouple  the  extended  Kalman  filter 
(FKF)  equations.  Therefore,  to  speed-up  computations  parallel 
predictor-corrector  (P^C)  methods  have  been  used  to  speed-up  the 

p 

linearization  process  in  the  FKF.  Fbr  example,  the  PC  methods  are 
2  -  100  times  faster  than  sequential  methods  give,n  2  -  100. 

processors.  /pplicaticns  to  maximum  likelihood  estimation  of 
sinusoidal  signals  in  wide-band  noise  has  also  been  studied. 


1.3.6  Parallel  Algorithms  for  Seismic  Signal  Processing  (USC[l2]) 

Parallel  Processing  techniques  for  generating  synthetic 
seismoerams  and  for  the  computation  of  the  output  of  a  horizontally 
stratified,  ncn-ahsorptive  medium  propagating  plane  waves  vertically; 


heve  been-stud ied . 
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2.  Highly  Parallel  Computing  Structures 

The  aforementioned  research  effort  cn  signal  prccesing  algorithm 
and  theory,  equipped  with  parallel  algorithms,  and  adaptive  on-line 
processing  techniques,  will  serve  as  a  useful  cornerstone  for  real¬ 
time  high  performance  signal  processing  area.  However,  the  real 
ma.icr  thrust  for  high-speed  signal  processing  lies  in  effective 
utilization  of  the  enormous  computation  capability  provided  by  the 
VLFI  circuits.  Therefore,  cur  cth^r  research  task  aims  to  bring  the 
revolutionary  VLFI  device  technology  to  an  effective  signal 
processing  application. 


2.1  Dedicated  Architectures  for  Signal  Processing 

2.1.1  Pipelined  Toeplitz  System  Solver  ('J  FC[6-?]) 

This  new  parallel  algorithm  for  solving  Toeplitz  system  can  be 
implemented  for  parallel  computation  with  full  compliance  with  the 
VLFI  communication  constraint.  Fpecif ical ly,  a  pipelined  processor 
architecture  with  0(tl)  crocesscrs  is  developed  which,  uses  only 
localized  interconnections  and  still  retains  the  maximum  parallelism 
a  ttainable. 

We  believe  that  the  proposed  pipelined  Toeplitz  system  solver 
[5,  6]  is  perhaps  the  most  efficient,  fast,  and  practical  (in  VLFI 
sense)  design  available  for  solving  Toeplitz  systems.  Moreover,  the 
design  methodology  danonstra ted  in  this  work  should  also  help  answer 
seme  fundamental  problems  faced  in  designing  cf  VLFI  parallel 
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processor  archi tec  Lures . 


2.1.2  Architectures  for  Ladder  Forms  and  Parallel  Kalman  Filters 
(ISl[25-27]) 

Ladder-form  architectures  for  implementing  the  recursive  least- 
squares  ladder-form  algorithm  have  been  developed. 

Fxisting  systolic  array  architectures  are  bein^  evaluated  to 
detemine  which  architectures  are  suitable  for  implementing  parallel 
Kalman  filters  (°KF's).  Pystclic  array  architectures  for  fhclesky 
decomposition  and  triangulari za ticn  have  been  considered  to  implement 
square- root  PKF's.  In  addition,  VLPT  architectures  based  on  mapping 
the  Kalman  filter  equations  directly  onto  the  processor  architectures 
have  been  examined. 


2.2  Array  Processors  for  Fignal  Processing 

2.2.1  Wavefront  Array  Processor  (USCf8-1l]) 

The  traditional  design  of  parallel  computers  and  languages  is  not 
very  suitable  for  the  design  of  VLFI  array  processors  for  signal 
processing.  VLFI  imposes  the  restrictions  cf  local  da ta-dependence 
and  recursivity  on  the  algorithms  that  can  be  handled  by  such  an 
array  processor.  Fuch  algorithms  can  be  viewed  as  a  sequence  of 
waves  (cf  data  and  computational  activity).  This  naturally  leads  to 
a  wavefront  based  programmable  computing  network,  which  we  call  the 


Wavefront  Array  Processor  (VJ A?). 

Our  contribution  hinges  upon  the  development  of  a  wavefron  t-bas.ed 
language  and  architecture  for  a  programmable  special  purpose 
multiprocessor  array.  Based  on  the  notion  of  computational 
wavefront , the  hardware  of  the  processor  array  is  designed  to  provide 
a  computing  medium  that  preserves  the  key  properties  of  the 
wavefront.  In  conjunction,  a  wavefronL  language  (MDIL)  is  introduced 
that  drastically  reduces  the  complexity  of  the  description  of 
parallel  algorithms  and  simulates  the  wavefront  propagation  across 
the  computing  network.  together,  the  hardware  and  the  language  lead 
tc  a  programmable  Wavefront  f rray  Processor  (WA?).  The  W/P  blends 
the  advantages  of  the  dedicated  systolic  array  and  the  general 
purpose  Data-Flow  machine  and  provides  a  powerful  tool  for  the  high 
speed  execution  of  a  large  class  of  matrix  operations  and  related 
algorithms  which  have  widespread  applications. 

2.2.2  Array  Processor  Testbed  (S0SC[l6]) 

fystclic  architectures  have  been  developed  for  signal  processing 
tasks  and  systolic  implementations  have  been  examined  fcrm=trix 
multiplication,  partitioned  matrix  inversion,  computation  of 
crossambiguity  functions,  formation  of  cuter  products  and  skewed 
outer  products,  and  multiplication  of  arbitrary  matrices  by  Hankel 
and  Tceplitz  matrices.  Implementation  of  the  proneralized  singular 
value  decomposition  of  Van  Loan  has  been  identified  as  an  important 
task  for  high  resolution  direction  finding. 
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Previous  systolic  architectures  have  been  modified  and  extended  to 
provide  improved  modularity  with  respect  to  the  set  of  functions  to 
be  performed  and  the  size  of  the  data  arrays  to  be  processed, 
fpecifically,  an  improved  systolic  architecture  called  the 
"Engagement  Processor",  a  t.YFe  wavefront  processor,  has  been 

inven  ted . 

In  addition,  a  microprocessor  based  °  x  P  systolic  array  is 

being  built  at  NOFC  to  serve  as  a  test  bed  for  future  algorithm 

developnenis ,  architecture  designs  (including  systolic,  wavefront, 

and  engagement  processor  array)  and  VLPT  inpl emen  in  lien . 


2.3  VLSI  Implementation  (Hughesfl9]) 

/■  Hardware  Implementations  of  the  Tceplitz  System  Solver  and  ether 
related  systems  e.g.  for  Fadeev  algorithm,  in  VLST  was  undertaken  at 
Hughes  (HEL(IP)) 

Our  work  has  included  an  extensive  investigation  of  the  various 
possible  hardware  implementations  (fixed  point  versus  flcatinp  point, 
serial/parallel  versus  parallel  only,  etc.)  and  arithmetic 
algori ihms .  Chip  organization,  pin-cut  problems,  cell  designs,  and 
chip-tc-chip  ccmmunica liens  are  also  considered. 

There  are  several  important  design  ar.d  architectural 
considerations  that  make  our  approach  well  suited  to  the  capabilities 
of  VLSI  technologies.  These  are  summarized  below: 


Identical  processing  elementi 
Local  communication 

Expandability,  allowing  chips  to  function  on  arbitrary- 
sized  kernals 

Local  data  storage 

exploitation  of  full  concurrency  (minimum  number  of  idle 
processors) . 


3.  SJHM/Hf 


In  conclusion,  in  order  to  keep  pace  with  the  rapid  advance  in  ihe 
VLSI  technology,  the  signal  processing  cckmunity  should  net  only  look 
into  advanced  processing  theory  and  parallel  processing  methods,  but 
also  exercise  a  timely  influence  on  architectural  design  of  future 
VLSI  computing  structures.  This  is  exactly  the  goal  of  cur  OKP-FEO  IT 
project.  Keeping  up  the  current  momentum  ,  our  joint  effort  will 
definitely  represent  a  mrjcr  contribution  to  the  modern  VLFT  Fignal 
processing  technology. 


I 
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4 .  Summary  of  report  -  UCSD 


MPL-U-9/82 


ADAPTIVE  LEAST  SQUARES  LATTICE  STRUCTUR'D  AND  THEIR 
APPLICATIONS  TO  PROFLEKS  IN  UNDERWATER  ACOUSTICS 


Progress  Report 

1  Septecber  1981  -  28  February  1982 
W.  S.  Hodgkiss 


I.  BACKGROUND 

Numerous  applications  exist  which  require  a  linear  filtering 
operation.  Often,  the  nature  of  that  filtering  task  (e.g.  spectral 
shaping  characteristics)  is  time  varying  in  some  nondeterministic 
fashion  due  to  nonstationarity  of  the  underlying  time  series.  In  such 
situations,  a  filter  which  can  adapt  to  a  changing  environment  is 
needed.  For  the  purpose  of  derivation,  the  adaptive  filter  must  have 
both  a  well-defined  goal  (e.g.  linear  prediction)  and  a  well-defined 
performance  measure  (e.g.  weighted  summation  of  the  squared  prediction 
error  residuals).  Once  derived,  it  is  extremely  important  to  under¬ 
stand  the  performance  characteristics  of  the  adaptive  filter  both  from 
a  theoretical,  as  well  as  a  practical  point  of  view  prior  to  its  use 
in  the  processing  or  real  data. 


II  PROGRESS:  1  Septemer  1981  -  28  February  1982 

The  gradient  methods  of  adaptive  filter  implementation  require 
various  adaptive  power  estimates  to  be  made.  The  performance  sensi¬ 
tivity  of  the  gradient  methods  to  different  time  constants  of  the 
adaptive  power  estimation  loops  has  been  under  study.  For  the 
specific  case  of  underlying  frequency  versus  time  dynamics  consisting 
of  dual  steps,  the  simulations  presented  in  [1]  and  [2]  investigate 
this  sensitivity  in  the  context  of  a  frequency  tracking  problem. 

In  addition  to  these  performance  studies,  a  substantial  amount  of 
time  has  been  devoted  towards  implementing  software  for  the  calcula¬ 
tion  of  both  the  eigenvalues  and  singular  values  of  the  autocorrela¬ 
tion  matrix  arising  from  complex  data.  Adaptive  lattice  structures 
appear  to  show  significant  performance  advantages  over  their  direct 
form  counterparts  due  to  an  insensitivity  to  eigenvalue  spread.  Our 
desire  is  to  be  able  to  document  the  eigenvalue  spread  present  both  in 
future  controlled  simulations,  as  well  as  in  actual  sonar  data  which 
we  will  be  processing. 
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5.  Summary  of  Report  -  HRL 

Concurrent  VLSI  Architectures  for  Matrix  Operations  and 
Linear  Systems  Solution!  J . 0 .  Nash,  G.R.  Nudd,  S.  Hansen) 

Tt._  goal  of  this  project  is  to  develop  novel,  special  purpose  computing 
structures  capable  of  throughputs  far  exceeding  that  available  from  present 
day  commercial  computers.  To  achieve  this  goal  we  are  exploiting  the  con¬ 
currency  potentially  available  in  algorithms  associated  with  a  wide  range  of 
applications  (automated  production  techniques  including  image  analysis, 
robotics  control  and  solution  of  previously  intractable  simulation  problems). 

The  availability  of  Very  Large  Scale  Integration  (VLSI)  has  made  this  approach 
economically  feasible  and  at  the  same  time  has  considerably  enhanced  circuit 
performance. 

Matrix  operations  represent  a  major  part  of  the  computational  requirement 
encountered  in  many  computer  applications.  Examples  of  matrix  operations  are 
multiplication,  inversion,  and  L-U  decomposition.  In  signal  processing  such 
operations  can  be  found  in  adaptive  filtering,  data  compression,  cross-ambiguity 
calculations,  and  bcamf orming.  In  robotics  a  matrix  formulation  of  the  control 
of  manipulator  joints  in  a  Cartesian  coordinate  system  is  the  most  straight¬ 
forward  and  convenient.  Image  analysis  and  restoration  is  often  based  on 
'kernal"  operations  over  a  relatively  small  window  of  pixels,  e.g.,  30  x  30, 
as  might  be  found  in  relaxation  techniques.  Therefore,  specification  of  image 
processing  algorithms  in  terms  of  matrix  operations  occurs  in  a  natural  way,. 

Matrix  operations  are  well-suited  to  concurrent  implementations  in  which  a 
number  of  small,  identical  processors  operate  simul taneoulsy  on  the  matrix 
elements.  Thus,  a  set  of  "matrix  operator"  chips,  made  using  VLSI,  would,  when 
coupled  to  a  general  purpose  host  computer,  provide  both  a  high  computational 
throughput  and  the  flexibility  to  perform  a  wide  range  of  algorithms  spanning 
many  applications.  This  is  a  far  more  efficient  approach  than  mapping  a 
particular  algorithm  directly  into  hardware. 

The  increase  in  throughput  obtainable  by  using  an  array  of  processors 

operating  concurrently  is  proportional  to  the  number  of  processors.  Thus,  a 

3 

matrix  inversion,  which  takes  0(n  )  steps  can  be  performed  in  0(n)  time  steps 
using  an  n  x  n  array  of  processors.  The  speedup  in  time  is  then  a  factor  of  n~. 


2] 


As  an  example  of  the  utility  of  the  matrix  approach,  consider  the  requirements 
for  control  of  servoing  typical  manipulators  with  approximately  1  meter  of 
reach.  The  most  time  consuming  part  of  this  computation  is  the  calculation  of 
the  pseudo  inverse  of  the  Jacobian.  A  matrix  formulation  of  the  control  problem 
would  require  approximately  2,000  multiplications  in  250  microseconds.1  Assum¬ 
ing  that  one  multiplication  could  be  performed  every  microsecond,  a  serial 
processor  would  take  2  milliseconds;  however,  a  6  x  6  array  of  processors,  each 
with  the  same  1  microsecond  multiplication  time,  would  complete  the  calculation 
in  approximately  55  microseconds. 

The  processor  arrays  we  are  investigating  incorporate  important  design  and 
architectural  considerations  that  make  our  approach  well  suited  to  the  capabili¬ 
ties  of  VLSI  technologies.  These  are: 

•  Identical  processing  elements 

•  Local  communication 

•  Expandability,  allowing  chips  to  function  on  arbi trary-si2ed  kernals 

•  Local  data  storage 

•  Exploitation  of  full  concurrency  (minimum  number  of  idle  processors). 

Each  processing  element  consists  of  a  basic  inner-product  ("ax+b')  calc  .l'.tor 
and  communicates  only  with  its  nearest  North,  South,  West,  East,  and  d;‘  e„;ona  ’ 
neighbor. 

We  will  describe  in  this  report  two  basic  systems  which  we  have  been 
investigating.  The  Toeplitz  linear  system  solver,  first  suggested  by  S.Y.  Kung," 
is  based  on  the  Weiner-Levinson  algorithm  and  is  suitable  for  operations  on  a 
Toepl itz-type  matrix  (elements  along  any  diagonal  are  identical).  Toeplitz 
matrices  appear  in  stationary  signal  processing  applications.  For  example,  the 
autocorrelation  matrix  is  often  of  this  type.  The  second  system  >’e  are 
investigating  is  based  on  the  Faddeev  algorithm  and  is  suited  for  general 
matrices  where  there  is  no  overriding  symmetry.  In  Section  2  we  describe  these 
algorithms  and  their  functional  architectural  embodiments. 
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In  Section  3  we  report  on  numerous  detailed  rumerical  studies  we  have 
performed.  These  statistical  studies  were  done  by  simulating  the  various 
architectures  using  the  high  level  programming  language,  APL,  which  has  the 
capability  of  operating  on  arrays  of  numbers  in  a  straightforward  and  intuitive 
way.  The  simulations  have  served  three  main  functions.  First,  they  have 
provided  a  means  for  verifying  the  correctness  of  the  data  flow  within  the 
Toeplitz  and  Faddeev  matrix  processors.  Second,  they  have  allowed  us  to  study 
the  numerical  stability  of  the  algorithms  for  representative  signal  processing 
data.  And  third,  they  have  been  useful  in  predicting  effects  of  finite  register 
lengths,  roundoff  techniques,  and  pivoting  schemes. 

In  Section  4  we  describe  the  hardware  implementation  of  the  Toeplitz 
Linear  System  Solver.  This  required  an  extensive  investigation  of  the  various 
possible  hardware  implementations  (fixed  point  versus  floating  point,  serial/ 
parallel  versus  parallel  only,  etc.)  and  arithmetic  algorithms.  Chip  organiza¬ 
tion,  pin-out  problems,  cell  designs,  and  chip-to-chip  communications  are  also 
discussed  and  the  final  chip  layout  shown. 

/ 


/ 
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6.  Summary  of  Report  -  NOSC 

Signal  Processing  With  Systolic  Arrays 
J.  M,  Speiser  and  H.  J.  Whitehouse 

Naval  Ocean  Systems  Center 
San  Diego,  CA.  92152 

OBJECTIVE:  To  devise  parallel  computing  architectures  for  present  and  future 
Navy  needs  in  real-time  signal  processing,  with  emphasis  on  spectrum  analysis 
and  direction  finding. 

APPROACH:  Select  the  needed  signal  processing  operations  which  present  a  large 
computational  load,  and  determine  corresponding  computational  algorithms  which 
exhibit  parallelism  and  regularity  of  data  flow.  Structure  such  algorithms  as 
systolic  architectures  and  examine  the  required  arithmetic  operations  and  I/O 
between  successive  algorithms  or  systolic  subsystems. 

PROGRESS:  (through  Feb  82)  systolic  architectures  have  been  developed  for  signal 
processing  tasks  and  systolic  implementations  have  been  examined  in  coordination 
with  the  NOSC-university-industry  SRO  task  team  on  spectrum  analysis:  partitioned 
matrix  multiplication,  partitioned  matrix  inversion,  computation  of  crossambiguity 
functions,  formation  of  outer  products  and  skewed  outer  products,  and  multiplica¬ 
tion  of  arbitrary  matrices  by  Hankel  and  Toeplitz  matrices.  Implementation  of  the 
generalized  singular  value  decomposition  of  Van  Loan  has  been  identified  as  an 
important  task  for  high  resolution  direction  finding.  Two  sessions  on  parallel 
processing  algorithms  and  architectures  were  organized  for  the  SPIE  International 
Symposium  in  San  Diego,  August,  1981. 

Previous  systolic  architectures  have  been  modified  and  extended  to  provide  improved 
modularity  with  respect  to  the  set  of  functions  to  be  performed  and  the  size  of  the 
data  arrays  to  be  processed.  Specifically,  an  improved  systolic  architecture 
called  the  "Engagement  Processor",  a  type  of  wavefront  processor,  has  been  invented 
and  has  the  following  advantageous  attributes: 

a)  Permits  more  efficient  multiplication  of  dense  matrices  and  explicitly 
provides  for  partitioned  multiplication  of  matrices  having  more  elements  than 
the  number  of  processors  in  the  array.  Efficiency  is  approximately  1/3  for  a 
single  matrix  multplication  of  matrices  which  match  the  processor  in  size,  and 
nearly  1  for  the  pipelined  multiplication  of  successive  matrices,  including  the 
partitioned  multiplication  of  large  matrices. 

b)  Can  include  the  modified  hexagonal  array  of  H.  T.  Kung  as  a  subset 
permitting  L-U  decomposition  of  matrices  using  the  same  processor. 

c)  Permits  efficient  utilization  of  the  processors  for  covariance 
estimation  for  array  data. 

d)  Permits  calculation  of  an  N  point  DFT  in  about  11  N.5  arithmetic 
cycles  using  N  processors,  including  I/O  time  for  multiplexed  I/O. 

For  matrix  multiplication,  covariance  estimation,  and  DFT  calculation,  the 
engagement  processor  may  be  viewed  as  a  rectangular  systolic  array  with  two  types 
of  memory  augmentation:  local  memories  for  the  inner  product  step  processors, 
permitting  parallel  access  to  stored  arrays,  and  two  sets  of  virtual  delay  lines, 
acting  as  I/O  buffers  at  two  edges  of  the  array.  The  same  processor  array  may 
be  used  to  perform  L-U  decomposition  of  matrices  with  only  minor  modification 


because  of  the  known  result  of  S.  Y.  Kung  that  a  hexagonal  systolic  array 
may  be  viewed  as  a  rectangular  array  with  diagonal  interconnects. 

The  engagement  processor  systolic  architecture  has  been  extended  to  provide 
faster  matrix  multiplication  with  only  a  nominal  increase  in  hardware  complexity. 

One  extension  uses  a  single  "bus  expander"  to  permit  multiplication  of  an  arbi¬ 
trary  real  N  by  N  matrix  with  a  Toeplitz  of  Hankel  matrix  with  only  vector 
storage.  The  signal  processing  operations  of  outer  product  formation  triple 
product  convolution  and  skewed  outer  product  formation  were  also  examined  for 
timing  on  an  engagement  processor.  Combinations  of  these  and  the  previously 
described  operations  were  examined  in  order  to  compare  four  algorithms  for 
crossambiguity  function  calculation  on  a  systolic  processor  -  the  algorithms 
previously  examined  for  optical  processor  implementation:  a  two-dimensional 
Fourier  transform  technique,  the  combination  of  a  skewed  outer  product  and  one¬ 
dimensional  DFT  (as  used  in  a  -r-slice  processor),  the  combination  of  pointwise 
multiplication  and  Hankel  matrix  multiplication  (as  used  in  a  space-integrating 
processor),  and  a  simulated  triple  product  convolution  (as  used  in  a  time-inte¬ 
grating  processor).  The  last  two  techniques  were  significantly  faster  on  an 
engagement  processor  than  the  first  two:  time  4N  versus  13N  and  7N.  Also,  a 
technique  was  devised  to  combine  toroidal  interconnection  of  an  engagement 
processor  with  a  set  of  N  bus  expanders  to  permit  the  multiplication  of  arbitrary 
N  by  N  matrices  in  time  N  using  N2  processors.  This  corresponds  to  100%  efficient 
use  of  the  processors  without  requiring  pipelining  of  successive  matrix  multipli¬ 
cations,  but  requires  non-nearest  neighbor  interconnections. 

We  are  currently  examining  parallel  algorithms  and  systolic  architectures  for 
orthogonal-triangular  factorization  of  matrices  for  least  squares  solution  and 
eigensystems  via  the  QR  algorithm,  as  well  as  constrained  least  squares  solution 
and  recursive  updating  of  least  squares  solutions.  Current  bottlenecks  appear 
to  be  accumulation  of  the  orthogonal  matrix,  incorporation  of  shifts,  and  the 
incorporation  of  pivoting  in  a  partitioned  QR  decomposition. 

We  have  observed  that  in  most  signal  processing  appli cations whgre  the  eigensystem 
of  a  covariance  matrix  is  desired,  the  matrix  is  estimated  as  ft  =  AT  A,  where  A 
is  experimentally  observed.  It  is  therefore  numerically  desirable  to  find  the 
eigensystem  of  ft  from  the  singular  value  decomposition  of  A  rather  than  computing 
directly  with  ft.  If  the  SVD  of  A  is  A  =  PDQT,  where  P  and  Q  are  orthogonal  matrices 
and  D  is  diagonal,  then  ft  =  A^A  =  Q*  D2  Q,  so  the  eigenvalues  of  ft  may  be  computed 
as  the  squares  of  the  singular  values  of  A,  and  the  eigenvectors  of  ft  are  the 
right  singular  vectors  of  A.  This  suggestion  has  been  applied  by  the  Marine  Physical 
Laboratory  to  the  estimation  of  the  spectrum  of  a  multiple  tone  signal.  Using  PDP-11 
floating  point  arithmetic,  the  computational  dynamic  range  was  increased  by  approxi¬ 
mately  60  dB  when  the  eigenvalues  of  R  were  computed  as  the  squares  of  the  singular 
values  of  A.  We  are  currently  examining  the  applicability  of  generalized  eigensys- 
tens  to  Navy  direction  finding  problems.  We  propose  to  extend  this  approach  to  the 
generalized  eigensystem  computation,  by  using  the  generalized  singular  value  decom¬ 
position  of  Van  Loan. 

As  alternatives  to  the  QR  algorithm  for  the  eigensystem  problem  and  the  Golub 
adaptation  of  QR  to  the  singular  value  decomposition,  we  are  examining  Jacobi  methods 
for  the  eigensystem  problem  and  the  one-sided  orthogonal izati on  adaptions  of  the 
Jacobi  method  for  the  singular  value  decomposition.  Although  the  Jacobi  and  related 
methods  require  approximately  three  times  the  number  of  multiplication  of  the 
corresponding  QR  methods,  the  Jacobi  methods  have  two  strong  advantages:  a)  easier 
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parallelization;  b)  in  many  signal  processing  applications  a  great  speedup  is 
possible  by  applying  a  preconditioning  transformation. 

For  problems  requiring  the  eigensys terns  of  the  covariance  matrix  of  a  wi de¬ 
sense  stationary  random  process,  or  the  corresponding  singular  value  decomposition 
of  A,  where  R  =  A'A,  the  basis  vectors  of  the  discrete  Fourier  transform  are  an 
approximate  eigensystem  for  ft,  and  may  be  used  to  approximately  diagonalize  ft  by 
a  unitary  similarity  transformation,  or  to  perform  an  approximate  one-sided 
orthogonal izati on  on  A.  The  corresponding  Jacobi  iterations  or  one-sided  rotations 
may  be  used  to  rapidly  improve  the  decomposition. 

Publications  during  this  period  have  included  a  tutorial  on  recent  parallel 
architectures  [10],  an  exposition  of  the  application  of  current  systolic  archi¬ 
tectures  to  sonar  problems  [11],  and  a  description  of  the  NOSC  systolic  testbed 
for  architecture  validation  and  algorithm  development  [12]. 

In  order  to  facilitate  the  development  of  algorithms  and  software  for  wavefront 
processors,  NOSC  will  provide  an  8  x  8  programmable  systolic  arrayL'2]  to  USC. 
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SECTION  1 
PROJECT  STATUS 


This  report  describes  the  research  activities  of  Integrated 
Systems,  Inc.  (ISI)  on  contract  number  N00014-81-X-0191  for  the 
period  from  1  October  1981  to  1  January  1982.  The  research 
activities  have  focussed  on  five  major  tasks  (see  Figure  1). 
Project  spending  is  indicated  in  Figure  2.  The  proposed  research 
on  nonlinear  signal  processing  algorithms  and  architectures  is 
being  performed  on  schedule  and  within  budget.  A  summary  of 
the  technical  work  is  described  in  Section  1.1  -  1.5  and  in  the 
attached  technical  memos. 

1.1  LADDER-FORM  ALGORITHMS 

A  recursive  least-squares  ladder-form  algorithm  for  predicted 
residuals  rather  than  filtered  residuals  has  been, derived.  The 
reflection  coefficients  and  the  order  updates  of  the  residuals  in 
the  new  algorithm  are  computed  simulateously .  This  formulation 
improves  the  throughput  rate  and  numerical  stability  of  existing 
recursive  least -squares  algorithms.  The  details  of  the  newly 
developed  ladder-form  algorithm  are  discussed  in  ISI  Technical 
Memo  5016-03. 

1.2  PARALLEL  KALMAN  FILTER  ALGORITHMS 

The  research  on  Parallel  Kalman  Filters  (PKFs)  has  been 
divided  into, two  major  tasks:  PKFs  for  linear  signal  processing 
applications  and  PKFs  for  nonlinear  signal  processing. 
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FIGURE  2.  HIGHLY  PARALLEL  MODERN  SIGNAL  PROCESSING  SCHEDULE 


Figure  2  First  Year  Project  Spending 
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1.2.1  PKFs  for  Linear  Signal  Proc 


essin* 


>1  ications 


For  linear  signal  processing  applications,  the  predictor  and 
corrector  equations  in  the  Kalman  filter  can  be  computed  on  sepa¬ 
rate  processors  simultaneously.  A  PKF  has  been  coded  and  simu¬ 
lated.  A  stability  analysis  of  the  PKF  is  currently  underway. 
Systolic  array  architectures  for  implementing  the  PKF  with  VLSI 
technology  have  been  investigated  (see  Section  1.5). 


1.2.2  PKFs  for  Nonlinear  Sii 


Processini 


)pl ications 


For  nonlinear  signal  processing  applications  (such  as  spec¬ 
trum  estimation),  it  is  difficult  to  decouple  the  extended  Kalman 
filter  (EKF)  equations.  Therefore,  to  speed-up  computations 

parallel  predictor-corrector  (P  C)  methods  have  been  used  to 

2 

speed-up  the  linearization  process  in  the  EKF.  The  P  C  methods 

are  2  -  100  times  faster  than  sequential  methods  given  2  -  100 

processors.  Research  on  methods  for  varying  the  integration 

2 

step  size  has  continued  to  make  the  P  C  methods  more  efficient. 

The  parallel  EKF  has  been  applied  to  maximum  likelihood 
estimation  of  sinusoidal  signals  -in  wide-band  noise.  Square-Root- 
Free  Parallel  Quasi-Newton  methods  have  been  used  to  improve  the 
maximum  likelihood  parameter  estimates  (see  ISI  Technical  Memo 
5016-04). 


1.3  SCATTERING  THEORY 

This  task  has  just  been  started  (see  Figure  1).  Our  hope 
is  to  use  scattering  theory  to  decouple  large-scale,  2-D  signal 
processing  problems  into  blocks  which  can  be  processed  simul¬ 
taneously.  Scattering  theory  will  also  provide  physical  insight 
into  the  implementation  of  ladder  forms  and  parallel  Kalman 
filters. 
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1.4  SPECTRAL  ESTIMATION 

A  prediction  error  approach  based  on  auto-regressive-moving- 
average  (ARMA)  models  and  a  maximum  likelihood  (ML)  method  which 
utilizes  the  PKF  developed  under  Task  2  have  been  coded  and 
applied  to  power  spectrum  estimation  (see  ISI  Technical  Memos 
5016-04  and  5016-05).  The  results  indicated  that:  (1)  ARMA 
based  methods  gave  sharp  peaks  compared  with  auto-regressive  (AR) 
techniques  and  (2)  the  PKF  gave  excellent  ML  estimates  of 
sinusoidal  parameters  even  under  poor  signal-to-noise-ratio 
conditions.  Therefore,  research  on  ARMA  and  ML  based  spectral 
estimation  will  continue. 

1.5  ARCHITECTURES  FOR  LADDER  FORMS  AND  PARALLEL  KALMAN  FILTERS 

Ladder-form  architectures  for  implementing  the  recursive 
least-squares  ladder-form  algorithm  developed  under  Task  1  are 
described  in  ISI  Technical  Memo  5016-03. 

Existing  systolic  array  architectures  are  being  evaluated 
to  determine  which  architectures  are  suitable  for  implementing 
parallel  Kalman  filters  (PKFs).  Systolic  array  architectures 
for  Cholesky  decomposition  and  triangularization  have  been 
considered  to  implement  square-root  PKFs.  In  addition,  VLSI 
architectures  based  on  mapping  the  Kalman  filter  equations 
directly  onto  the  processor  architectures  have  been  examined. 

The  results  of  this  study  will  be  included  in  the  final  report. 
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SECTION  2 


FUTURE  WORK 


Future  research  will  be  conducted  in  accordance  with  the 
program  schedule  (see  Figure  1).  Research  will  continue  on 
ladder  forms,  parallel  Kalman  filters,  and  spectral  estimation. 
Emphasis  will  be  placed  on:  (1)  parallel  architectures  which 
are  suitable  for  implementation  with  VLSI/VHSIC  technology,  (2) 
stability  analysis  of  the  PKF  algorithms,  (3)  extending  the  PKF 
algorithms  and  architectures  for  2-D  signal  processing  applica¬ 
tions  (e.g.,  image  processing),  and  (4)  integrating  the  research 
activities  of  the  SRO  participants  (e.g.,  using  the  data  flow 
language  to  emulate  the  PKF  architectures).  Simulation  studies 
will  continue  to  evaluate  the  performance  of  the  newly  developed 
parallel  algorithms  and  architectures  for  spectral  estimation. 
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SECTION  3 

TECHNICAL  PUBLICATIONS 


The  following  technical  papers  have  been  written  under 
contract  number  N00014-81 -K-0191 : 


1.  Travassos,  R.  H.,  "Parallel  Processing  Algorithms 
for  Unconstrained  Minimization,"  prepared  for  the 
Journal  of  Optimization  Theory  and  Application , 

July  1981. 

2.  Travassos,  R.  H. ,  "Square-Root  Parallel  Quasi-Newton 
Methods  for  Nonlinear  Optimization,"  invited  paper 
presented  in  the  special  session  on  parallel  opti¬ 
mization  techniques,  Optimization  Days,  Montreal, 
Canada,  May  1981. 

3.  Travassos,  R.  H.f  "Parallel  Processing  Algorithms 
for  System  Parameter  Identification,"  prepared  for 
the  IFAC  Symposium  on  System  Identification, 

Washington,  D.  C.  ,  June  1982. 

4.  Travassos,  R.  H. ,  "Parallel  Kalman  Filtering,"  IS1 
Technical  Memo  5016-01,  October  1981. 

5.  Jover,  J.  M.  and  Travassos,  R.  H.  ,  "Derivation  of 
Algorithms  for  UDU^  Kalman  Filters,"  ISI  Technical 
Memo  5016-02,  October  1981. 

6.  Reddy,  V.  U. ,  "A  Numerically  Stable  and  High-Speed 
Recursive  Least-Sqm  res  Ladder-Form  Algorithm," 

ISI  Technical  Memo  5016-03,  November  1981. 

7.  Travassos,  R.  H. ,  "K aximum  Likelihood  Estimation  of 
Sinusoidal  Signals  in  Wide-Band  Noise,”  ISI  Techni- 
■al  Memo  5016-04,  January  1982. 

8.  ]  -ddy,  V.  U. ,  Travassos,  R.  H. ,  and  Kailath,  T.,  "A 

l  .imparison  of  <onlin=ar  Spectral  Estimation  Techniques," 
131  Technical  Memo  5116-05,  January  1982. 

9.  Travassos,  R.  H.  and  Andrews,  A.,  "VLSI  Implementation 
of  Parallel  Kalman  Filters,”  prepared  for  the  AIAA 
Guidance  and  Control  Conference,  San  Diego,  California, 
August  1982. 
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1  PARALLEL  COMPUTING  STHJCTORES  (S.Y.Kung,  Y.H.Hu,  R.Gal-ezer,  K.S.Arun 


,D.V.B.  Rao) 

With  the  rapidly  growing  microelectronics  technology  leading  the  way, 
modem  signal  processor  architectures  are  undergoing  a  major  revolution. 
The  availability  of  low  cost,  fast  VLSI  (Very  Large  Scale  Integration) 
devices  premises  the  practice  of  cost-effective,  high  speed,  parallel 
processing  of  large  volume  of  data.  This  makes  possible  ultra  high 
thro ughput- rate  and  therefore,  designates  a  major  technological 
breakthrough  for  real-time  signal  processing  applications.  On  the  ether 
hand,  it  has  become  mere  critical  than  ever  to  gain  a  fundamental 
understanding  of  the  algorithm  structure,  architecture,  and 
implementation  constraints  in  order  to  realize  the  full  potential  of 
VLSI  computing  power.  In  cur  work,  the  two  most  critical  issues 
-  parallel  computing  algorithm  and  VLSI  architectural  constraint  will  be 
considered  : 

1.  To  structure  the  algorithm  to  achieve  the  maximum  parallelism 
and,  therefore,  the  maximum  throughput- rate. 

2.  To  cope  with  the  ccmmuni cation  constraint  sc  as  to  compromise 
least  in  processing  throughput- rate. 

1  . 1  A  highly  concurrent  Toeplitz  system  solver  T5-6] 

Eased  on  the  above  considerations,  we  have  developed  a  highly 
concurrent  Toeplitz  system  solver,  featuring  maximum  parallelism  and 
localized  communication. 

Toeplitz  systems  arise  in  numerous,  wide-spread  applications  ranging 
from  speech,  image,  neurophysics,  to  radar,  sonar,  geophysics,  and 
astronomical  signal  processing  .  Our  contribution  lies  in  the 
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development  of  a  highly  concurrent  algorithm  and  pipelined  architecture 
which  is  able  to  solve  a  Tceplitz  system  in  0(ll)  processing  time  in  an 
array  proecssor,  as  opposed  to  0(N**3)  for  general  (sequential)  Gauss 
elimination  procedure  or  0(N**2)  for  (sequential)  Levinson  algorithm. 

For  parallel  consideration,  we  note  that  the  Levinson  procedure  has 
to  call  upon  an  inner  product  operation  to  compute  the  vital  "reflection 
coef fecients" .  Even  when  N  processors  is  utilized,  an  inner  product 
operation  will  need  at  least  lcgN  units  of  time.  This  will  amount  to  a 
total  of  0(  NlogN  )  units  of  computing  time  for  the  entire  Levinson 
procedure.  This  is  of  course  unsatisfactory  since  the  processors  are  net 
effectively  utilized. 

In  order  to  achieve  full  parallelism,  we  have  to  further  exploit  the 
Tceplitz  structure.  For  this  purpose,  we  have  proposed  a  new,  pipelined 
version  of  the  Levinson  algorithm  which  allows  the  reflection 
coefficients  to  be  computed  in  a  pipelined  fashion.  This  avoids  the  need 
of  the  inner  product  operations,  and  the  total  computing  time  is 
therefore  reduced  to  0(N). 

This  new  algorithm  can  be  implemented  in  full  compliance  with  the 
VLSI  communication  constraint.  More  precisely,  a  pipelined  processor 
architecture  is  developed  which  usee  oni,,  calized  interconnections  and 
still  retains  the  maximum  parallelism  attainable. 

In  summary,  we  believe  that  the  proposed  pipelined  Toeplits  system 
solver  is  perhaps  the  most  efficient,  fast,  and  practical  (in  VLSI 
sense)  design  available  for  solving  Toeplitz  systems.  Moreover,  the 
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design  methodology  demonstrated  in  this  work  should  also  help  answer 
some  fundamental  problems  faced  in  designing  of  VLSI  parallel  processor 
archi tec  tures. 

1.2  Toeplitz  Eigenvalue  Computation  [36] 

This  research  task  deals  with  the  parallel  computation  of  the  minimum 
eigenvalue  of  a  Toeplitz  matrix.  The  minimum  eigenvalue  has  an 
important  interpretation  as  the  power  of  additive,  white  noise  to  be 
determined  in  a  noisy  statistical  environment.  In  many  high  resoluT'on 
spectrum  analysis  problems,  the  estimation  and  removal  of  such  noise 
contribution  is  essential  for  unbiased  estimates.  Our  objective  is 
again  to  derive  an  0(N)  computation  algorithm  to  estimate  the  minimum 
eigenvalue  of  a  given  Toeplitz  covariance  matrix.  This  goal  can  be 
accomplished  by  adopting  the  pipelined  Toeplitz  computing  structure 
discussed  earlier  and  a  careful  utilization  of  a  relationship  between 
the  minimum  eigenvalue  and  the  redisues  E  that  arise  in 

the  Levinson  algorithm.  Based  on  this  relationship,  a  fast  iterative 
procedure  is  developed  to  successively  estimate  the  minimum  eigenvalue. 
Based  on  simulation  results  for  such  an  application,  seme  improvements 
are  observed  in  both  the  computing  speed  as  well  as  accuracy  of 
estimates.  Although  much  more  computational  complexity  analysis  is  yet 
to  be  demonstrated,  we  are  convinced  that  this  approach  will  have  a 
major  in  future  applications  of  high  speed,  high  resolution  spectrum 
estimation  problems. 
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1 .3  Wavefront  Array  Processor 

The  traditional  design  of  parallel  computers  and  languages  usually 
suffers  from  heavy  supervisory  overhead  incurred  by  synchronization, 
communication,  and  scheduling  tasks,  which  severely  hamper  the 
throughput  rate  which  is  critical  to  real-time  signal  processing. 

Furthermore,  additional  restrictions  imposed  by  VLSI  will  render  the 
general  purpose  array  processor  very  inefficient.  We  therefore  restrict 
ourselves  to  a  special  class  of  applications,  i.e.  recursive  and  local 
data  dependent  algorithms,  to  conform  with  the  constraints  imposed  by 
VLSI.  however,  this  restriction  incurs  little  loss  of  generality,  as  a 
great  majority  of  signal  processing  algorithms  possess  these  properties. 
One  typical  example  is  a  class  of  matrix  algorithms. 

Very  significantly,  these  algorithms  involve  repeated  application  of 
relatively  simple  operations  with  regular  localized  data  flew  in  a 
homogeneous  computing  network.  This  leads  to  an  important  notion  of 
computational  wavefront,  which  portrays  the  computation  activities  in  a 
manner  resembling  a  wave  propagation  phenomenon.  More  precisely,  the 
recursive  nature  of  the  algorithm,  in  conjunction  with  the  localized 
data  dependency,  points  to  a  continuously  advancing  wave  of  data  and 
computational  activity. 

The  wavefront  concept,  provides  a  firm  theoretical  foundation  for  the 
design  of  highly  parallel  array  processors  and  concurrent  languages. 
Moreover,  uhis  concept  appears  to  have  some  distinct  advantages. 

Firstly,  the  wavefront  notion  drastically  reduces  the  complexity  in 
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the  description  of  parallel  algorithms.  The  mechanism  provided  for  this 
description  is  a  special  purpose,  wavefront-oriented  language.  Rather 
than  requiring  a  program  for  each  processor  in  the  array,  this  language 
allows  the  programmer  to  address  an  entire  front  cf  processors. 

Secondly,  the  wavefront  notion  leads  to  a  wavefront-  based 

architecture  that  conforms  with  the  constraints  of  VLSI,  and  supports  a 
major  class  of  signal  processing  algorithms.  As  a  consequence  of 
Huygen's  principle,  wavefronts  should  never  intersect.  With  a  wavefront 
architecture  that  provides  asynchronous  waiting  capability,  this 

principle  is  preserved.  Therefore,  the  wavefront  approach  can  cope  with 
timing  uncertainties,  such  as  local  clocking,  random  delay  in 

communications  and  fluctuations  cf  computing- times.  In  short,  there  is 
no  need  for  global  synchrcni za tion. 

Thirdly,  the  wavefront  notion  is  applicable  to  all  VLSI  signal 
processing  algorithms  that  possess  locality  and  recursivity,  and  hence, 
has  numerous  applications. 

The  integration  of  the  wavefront  concept,  the  wavefront  language  and 
the  wavefront  architecture  leads  to  a  programmable  computing  network, 
which  we  will  call  the  WAVEFRONT  ARRAY  PROCESSOR  (WAP ). The  WAP  is, in  a 
sen8e,an  optimal  tradeoff  between  the  globally  synchronized  and 
dedicated  systolic  array  (that  works  on  a  similar  set  of  algorithms), 
and  the  general-purpose  data-flow  multiprocessors.  It  provides  a 
powerful  tool  for  the  high  speed  execution  of  a  large  class  cf 
algorithms  which  have  widespread  applications.  The  applications  are 
very  broad  including  PDE  solver,  SVD,  linear  systems  solvers,  sorting 


5 


and  searching  routines. 
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There  exist  two  approaches  approaches  to  programming  the  WAP:  a 
local  approach,  describing  the  actions  of  each  processing  element,  and  a 
global  approach,  describing  the  actions  of  each  wavefront.  To  allow  the 
user  to  program  the  WAP  in  both  these  fashions,  two  versions  of  MDFL  are 
proposed  :  global  and  local  MDFL.  A  global  MDFL  program  describes  the 
algorithm  from  the  view- point  of  a  wavefront,  while  a  local  MDFL  program 
describes  the  operations  of  an  individual  processor.  Mere  precisely, 
the  perspective  of  a  global  MDFL  programmer  is  of  one  wavefront  passing 
across  all  the  processors,  while  the  perspective  of  a  local  MDFL 
programmer  is  that  of  one  processor  encountering  a  series  of  wavefronts. 

In  summary,  cur  contribution  hinges  upon  the  development  of  a 
wavefrent-based  language  and  architecture  for  a  programmable  special 
purpose  multiprocessor  array.  Based  on  the  notion  of  computational 
wavefront,  the  hardware  of  the  processor  array  is  designed  to  provide  a 
computing  medium  that  preserves  the  key  properties  of  the  wavefront.  In 
conjunction,  a  wavefront  language  (MDPL)  is  introduced  that  drastically 
reduces  the  complexity  of  the  description  of  parallel  algorithms  and 
simulates  the  wavefront  propagation  across  the  computing  netwrek. 
Together,  the  hardware  and  the  language  lead  to  a  programmable  Wavefront 
Array  Processor  (WAP).  The  WAP  blends  the  advantages  of  the  dedicated 
Slystolic  array  and  the  general  purpose  Data-Flow  machine  and  provides  a 
powerful  tool  for  the  high  speed  execution  of  a  large  class  of  matrix 
operations  and  related  algorithms  which  have  widespread  applications. 
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2  SIGNAL  PROCESSING  ALGORITHMS  AND  THEOBf  ( S.Y .Kung ,  Y.H.Hu,  D.V.B.  Rao) 

As  to  this  research  front,  it  hinges  upon  a  thorough,  in-depth 

understanding  of  mathematics  and  algorithm  analysis.  In  addition  to  the 
classical  mathematical  techniques  such  as  Fourier  transform,  linear 
dynamic  systems,  random  process,  etc.  there  arises  a  new  signal 
processing  mathematics  branch  which  can  be  grossly  termed  as  modern 
spectral  analysis.  Explicitly  or  not, • large  class  of  signal  processing 
applications  have  had  extensive  use  of  this  analysis  as  a  technical 
basis.  Therefore,  our  research  effort  aims  at  developing  a  theoretical 
and  algorithmic  basis  for  modern  spectral  analysis  methods  and  signal 
processing  applications. 

2. 1  Adaptive  Notch  filtering  (USC  [  1  —3] ) 

Using  a  steady  state  frequency  domain  approach,  a  new  method  has  been 
developed  for  the  retrieval  of  sinusoids/narrcwband  signals  in  additive 
noise  colored  or  white.  The  method  suggested  has  been  shewn  to  require 
smaller  filter  length  to  produce  unbiased  estimates,  compared  to  the 
existing  autoregressive  method.  For  its  implementation,  a  pole-zero 
filter  where  the  feedback  and  feedforward  coefficients  are  related 
(constrained  ARM A),  has  been  developed.  A  study  of  the  performance  and 
implementational  aspects  of  the  filter  have  been  undertaken.  The 
details  of  this  newly  developed  are  discussed  in  the  full  report.  For  a 
stable  implementation,  parallel  and  cascade  forms  have  been  shown  to  be 
useful.  A  parallel  processing  scheme  developed  shows  great  promise. 
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2.2  Relationships  Between  Several  Popular  Methods  for  Spectral 
Estimation  and  Array  Processing 

It  may  seem  too  ambitious  to  compare  all  currently  popular  high- 
resolution  spectral  estimation  methods.  However,  from  a  different 
perspective,  Pisarenko's  method  can  be  viewed  as  an  extension  of  the  MEM 
method  with  the  removal  of  the  noise  contribution.  Therefore,  an  attempt 
is  being  made  at  developing  a  unified  framework  for  the  spectral 
analysis  techniques.  Moreover,  the  unification  attempt  is  being  extended 
to  the  counterpart  of  spectral  analysis  in  array  processing  application, 
for  which  we  are  convinced  that  the  general  principles  remain  largely 
applicable.  We  are  currently  looking  into  theoretical  and  computational 
relevances  between  several  recent  array  processing  and  spectrum 
estimation  methods. 

2.3  Toeplitz  Approximation  Method  (USC[4]) 

Recently,  the  study  on  approximation  theory  and  its  applications  has 
received  considerable  attention.  In  our  work,  a  narrowband/sinusoidal 
signal  retrieval  problem  is  formulated  in  terms  of  approximation  of 
Toeplitz  autoccvariance  matrix.  A  Toeplitz  approximation  method  based 
on  singular  value  decomposition  is  proposed  and  simulation  results 
indicate  some  improvement  over  seme  previously  proposed  methods. 


3  REVIEW  OF  RESEARCH  ACTIVITIES  IW  IPI ,  USC  ( A.  A.  Sawchuk,  R.Chellappa) 
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3.1  Parallel  Algorithms  for  Image  Processing  and  Analysis 

Our  recent  research  at  Image  Processing  Insti  tute  ,U  SC,  has  been 
concerned  with  parallel  algorithms  for  image  processing  and  image 
analysis.  Most  of  the  effort  has  been  concerned  with  parallel 
implements tion  of  nonstationary  adaptive  image  restoration.  Recursive 
and  ncn- recursive  implementation  of  locally  adaptive  restoration  has 
been  studied.  These  techniques  estimate  the  local  nonstaticnary  mean 
and  variance  of  ideal  scenes  from  degraded  data.  Most  blurring 
degradations  are  also  highly  local,  so  that  local  parallel  processing 
combined  with  the  ncnstationary  image  model  data  can  be  used  to  minimize 
local  mean-square  errc  (MSE)  in  a  parallel  fashion.  We  have  shown  that 
local  MSE  is  net  a  bad  error  criterion  for  image  processing,  as  opposed 
to  the  usual  global  MSE  taken  over  the  entire  scene.  Global  MSE  often 
dees  net  correlate  well  with  human  observer  judgments  of  image  quality. 

We  have  locked  at  the  application  of  these  techniques  to  systems  with 
coherent  speckle  noise,  such  as  synthetic  aperature  (SAR)  imagery, 
coherent  sonar  and  acoustic  imaging.  Both  recursive  (Kalman-like)  and 
local  sectioned  parallel  implementations  are  being  studied  in  detail. 
In  addition,  we  have  began  investigations  on  parallel  feature  extraction 
for  texture  identification  and  texture  segmentation. 

3.2  Two  Dimensional  Spectral  Estimation 

Two-dimensional  spectral  estimation  is  of  interest  in  image 
restoration,  filtering  of  PAR  images  and  texture  classification.  Our 
recent  research  has  been  concerned  with  developing  systematic  methods 
for  2-D  spectral  estimation  from  raw  data  using  random  field  models.  We 
assume  that  the  given  finite  data  is  represented  by  an  appropriate 
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Gaussian  Markov  randan  field  (NRF)  model. 
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This  assumption  reduces  the  spectral  estimation  problem  to  that  of 
estimating  the  appropriate  structure  and  the  parameters  of  the  model. 
By  using  specific  finite  toroidal  lattice  representations  and  Gaussian 
maximum  likelihood  estimates  we  have  developed  new  2-D  spectral 
estimates.  It  turns  out  that  the  MRF  spectrum  is  also  the  maximum 
likelihood  spectrum  arising  in  frequency -wave  number  analysis. 
Furthermore,  the  sample  correlation  values  of  the  given  observations  in 
an  array  N  are  in  perfect  agreement  with  the  estimated  theoretical 
correlations  in  N  obtained  by  Fourier  inverting  the  MFF  spectrum. 
Thus  the  MRF  spectrum  developed  by  us  converges  to  the  2-D  maximum 
entropy  spectral  estimate  asymptotically.  Currently  we  have  begun 
investigations  on  parallel  implementation  of  the  algorithms  for  2-D 
spectral  estimation. 

In  addition,  we  are  also  investigating  the  use  of  another  class  of 
random  field  models  known  as  spatial  autoregressive  models  which  are 
white  noise  driven  non  causal  models  for  spectral  estimation. 


4  PARALLEL  PROCESSING  TECHNIQUES  FOR  SEISMIC 
PROCESSING (J.Mendel,J.Goutsiaa) 

Because  of  the  large  volume  of  information  involved  in  the  simulation 
and  processing  of  seismic  data,  and  the  amount  of  processing  required, 
parallel  techniques  have  begun  to  be  studied.  The  recent  development  of 
VLSI  systems  and  the  growing  sophestication  in  the  design  of  array 
processors  xan  lead  to  the  effecient  simulation  of  large  seismic  models. 
We  are  examining  some  possible  parallel  processing  techniques  for  the 
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computation  of  the  output  of  a  horizontally  stratified,  non  absorbtive 
medium  in  which  there  are  vertically  travelling  plane  compressions! 
waves . 

This  task  has  just  been  started  and  we  intend  to  look  at  different 


parallel  structures  for  generating  synthetic  seismograms. 
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